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Abstract 

The correspondence problem in computer vision is basically a matching task between two or more sets 
of features. Computing feature correspondence is of great importance in computer vision, especially in 
the subfields of object recognition, stereo, and motion. In this paper, we introduce a vectorized image 
representation, which is a feature-based representation where correspondence has been established with 
respect to a reference image. The representation consists of two image measurements made at the fea- 
ture points: shape and texture. Feature geometry, or shape, is represented using the (x,y) locations of 
features relative to the some standard reference shape. Image grey levels, or texture, are represented by 
mapping image grey levels onto the standard reference shape. Computing this representation is essentially 
a correspondence task, and in this paper we explore an automatic technique for "vectorizing" face images. 
Our face vectorizer alternates back and forth between computation steps for shape and texture, and a 
key idea is to structure the two computations so that each one uses the output of the other. Namely, the 
texture computation uses shape for geometrical normalization, and the shape computation uses the tex- 
ture analysis to synthesize a "reference" image for finding correspondences. A hierarchical coarse-to-fine 
implementation is discussed, and applications are presented to the problems of facial feature detection 
and registration of two arbitrary faces. 
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1 Introduction 

The computation of correspondence is of great impor- 
tance in computer vision, especially in the subfields of 
object recognition, stereo, and motion. The correspon- 
dence problem is basically a matching task between two 
or more sets of features. In the case of object recogni- 
tion, one set of features comes from a prior object model 
and the other from an image of the object. In stereo and 
motion, the correspondence problem involves matching 
features across different images of the object, where the 
images may be taken from different viewpoints or over 
time as the object moves. Common feature points are 
often taken to be salient points along object contours 
such as corners or vertices. 

A common representation for objects in recognition, 
stereo, and motion systems is feature-based; object at- 
tributes are recorded at a set of feature points. The 
set of feature points can be situated in either 3D as an 
object-centered model or in 2D as a view-centered de- 
scription. To capture object geometry, one of the object 
attributes recorded at each feature is its position in 2D 
or 3D. Additionally, if the object has an detailed tex- 
ture, one may be interested in recording the local surface 
albedo at each feature point or more simply the image 
brightness. Throughout this paper we refer to these two 
attributes respectively as shape and texture. 

Given two or more sets of features, correspondence 
algorithms match features across the feature sets. We 
define a vectorized representation to be a feature- 
based representation where correspondence has been es- 
tablished relative to a fixed reference object or reference 
image. Computing the vectorized representation can be 
thought of as arranging the feature sets into ordered vec- 
tors so that the ith element of each vector refers to the 
same feature point for all objects. Given the correspon- 
dences in the vectorized representation, subsequent pro- 
cessing can do things like register images to models for 
recognition, and estimate object depth or motion. 

In this paper, we introduce an algorithm for comput- 
ing the vectorized representation for a class of objects 
like the human face. Faces present an interesting class 
of objects because of the variation seen across individu- 
als in both shape and texture. The intricate structure of 
faces leads us to use a dense set of features to describe it. 
Once a dense set of feature correspondences have been 
computed between an arbitrary face and a "reference" 
face, applications such as face recognition and pose and 
expression estimation are possible. However, the focus of 
this paper is on an algorithm for computing a vectorized 
representation for faces. 

The two primary components of the vectorized rep- 
resentation are shape and texture. Previous approaches 
in analyzing faces have stressed either one component or 
the other, such as feature localization or decomposing 
texture as a linear combination of eigenfaces (see Turk 
and Pentland [37]). The key aspect of our vectorization 
algorithm, or "vectorizer" , is that the two processes for 
the analysis of shape and texture are coupled. That is, 
the shape and texture processes are coupled by mak- 
ing each process use the output of the other. The tex- 
ture analysis uses shape for geometrical normalization, 



and shape analysis uses texture to synthesize a refer- 
ence image for feature correspondence. Empirically, we 
have found that this links the two processes in a positive 
feedback loop. Iterating between the shape and texture 
steps causes the vectorized representation to converge 
after several iterations. 

Our vectorizer is similar to the active shape model 
of Cootes, et al. [17] [16] [23] in that both iteratively fit 
a shape/texture model to the input. But there are in- 
teresting differences in the modeling of both shape and 
texture. In our vectorizer there is no model for shape; it 
is measured in a data-driven manner using optical flow. 
In active shape models, shape is modeled using a para- 
metric, example-based method. First, an ensemble of 
shapes are processed using principal component analy- 
sis, which produces a set of "eigenshapes" . New shapes 
are then written as linear combinations of these eigen- 
shapes. Texture modeling in their approach, however, 
is weaker than in ours. Texture is only modeled locally 
along ID contours at each of the feature points defining 
shape. Our approach models texture over larger regions 
- such as eyes, nose, and mouth templates - which should 
provide more constraint for textural analysis. In the fu- 
ture we intend to add a model for shape similar to active 
shape models, as discussed ahead in section 6.2. 

In this paper, we start in section 2 by first providing a 
more concrete definition of our vectorized shape and tex- 
ture representation. This is followed by a more detailed 
description of the coupling of shape and texture. Next, 
in section 3, we present the basic vectorization method 
in more detail. Section 4 discusses a hierarchical coarse- 
to-fine implementation of the technique. In section 5, 
we demonstrate two applications of the vectorizer, facial 
feature detection and the registration of two arbitrary 
faces. The latter application is used to map prototypical 
face transformations onto a face so that new "virtual" 
views can be synthesized (see Beymer and Poggio [11]). 
The paper closes with suggestions for future work, in- 
cluding an idea to generalize the vectorizer to multiple 
poses. 

2 Preliminaries 

2.1 Vectorized representation 

As mentioned in the introduction, the vectorized repre- 
sentation is a feature-based representation where corre- 
spondence has been established relative to a fixed ref- 
erence object or reference image. Computationally, this 
requires locating a set of features on an object and bring- 
ing them into correspondence with some prior reference 
feature set. While it is possible to define a 3D, object- 
centered vectorization, the vectorized representation in 
this paper will be based on 2D views of frontal views of 
the face. Thus, the representations for shape and tex- 
ture of faces will be defined in 2D and measured relative 
to a 2D reference image. 

Since the representation is relative to a 2D reference, 
first we define a standard feature geometry for the ref- 
erence image. The features on new faces will then be 
measured relative to the standard geometry. In this pa- 
per, the standard geometry for frontal views of faces is 




Figure 1: To define the shape of the prototypes off-line, 
manual line segment features are used. After Beier and 

Neely [5]. 
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Figure 2: Manually defined shapes are averaged to com- 
pute the standard face shape. 



defined by averaging a set of line segment features over 
an ensemble of "prototype" faces. Fig. 1 shows the line 
segment features for a particular individual, and Fig. 2 
shows the average over a set of 14 prototype people. Fea- 
tures are assigned a text label (e.g. "ci") so that corre- 
sponding line segments can be paired across images. As 
we will explain later in section 3.1, the line segment fea- 
tures are specified manually in an initial off-line step that 
defines the standard feature geometry. 

The two components of the vectorized representation, 
shape and texture, can now be defined relative to this 
standard shape. 

2.1.1 Shape 

Given the locations of n feature points fi, f2, ■ ■ ■ , fn 
in an image i a , an "absolute" measure of 2D shape is 
represented by a vector y a of length 2n consisting of the 
concatenation of the x and y coordinate values 
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This absolute representation for 2D shape has been 
widely used, including network-based object recogni- 
tion (Poggio and Edelman [28]), the linear combinations 
approach to recognition (Ullman and Basri [38], Pog- 
gio [27]), active shape models (Cootes and Taylor [15], 



Figure 3: Our vectorized representation for image i a 
with respect to the reference image i s td at standard 
shape. First, pixelwise correspondence is computed be- 
tween i s td and i a , as indicated by the grey arrow. Shape 
y^-std i s a vec t or field that specifies a corresponding 
pixel in i a for each pixel in i s td- Texture t a consists of 
the grey levels of i a mapped onto the standard shape. 



Cootes, et al. [17]) and face recognition (Craw and 
Cameron [18][19]). 

A relative shape measured with respect to a standard 
reference shape y s td is simply the difference 

y a Ystd: 

which we denote using the shorthand notation y a -std- 
The relative shape y a -std is the difference in shape be- 
tween the individual in i a and the mean face shape. 

To facilitate shape and texture operators in the run- 
time vectorization procedure, shape is spatially oversam- 
pled. That is, we use a pixelwise representation for 
shape, defining a feature point at each pixel in a subim- 
age containing the face. The shape vector y a -std can 
then be visualized as a vector field of correspondences 
between a face at standard shape and the given image i a 
being represented. If there are n pixels in the face subim- 
age being vectorized, then the shape vector consists of 
2n values, a (8x,8y) pair for each pixel. In this dense, 
pixelwise representation for shape, we need to keep track 
of the reference image, so the notation is extended to in- 
clude the reference as a superscript y s a t _^ std . Fig. 3 shows 
the shape representation y s a t _^ std for the image i a . As in- 
dicated by the grey arrow, correspondences are measured 
relative to the reference face i st d at standard shape. (Im- 
age i s td in this case is mean grey level image; modeling 



grey level texture is discussed more in section 3.1.) Over- 
all, the advantage of using a dense representation is that 
it allows a simple optical flow calculation to be used for 
computing shape and a simple 2D warping operator for 
geometrical normalization. 

2.1.2 Texture 

Our texture vector is a geometrically normalized ver- 
sion of the image i a . That is, the geometrical differences 
among face images are factored out by warping the im- 
ages to the standard reference shape. This strategy for 
representing texture has been used, for example, in the 
face recognition works of Craw and Cameron [18], and 
Shackleton and Welsh [33]. If we let shape y s td be the 
reference shape, then the geometrically normalized im- 
age t a is given by the 2D warp 

t a (x, y) = i a (x + Ax s a ^ std (x, y), y + Ay s a ^ std (x, yj), 

where Ax*_ sM and A.y s a _ std are the x and y components 
of y a t ^ std , the pixelwise mapping between y a and the 
standard shape y s td- Fig. 3 in the lower right shows an 
example texture vector t a for the input image i a in the 
upper right. 

If shape is sparsely defined, then texture mapping 
or sparse data interpolation techniques can be em- 
ployed to create the necessary pixelwise level representa- 
tion. Example sparse data interpolation techniques in- 
clude using splines (Litwinowicz and Williams [24], Wol- 
berg [40]), radial basis functions (Reisfeld, Arad, and 
Yeshurun [31]), and inverse weighted distance metrics 
(Beier and Neely [5]). If a pixelwise representation is 
being used for shape in the first place, such as one de- 
rived from optical flow, then texture mapping or data 
interpolation techniques can be avoided. 

2.1.3 Separation of shape and texture 

How cleanly have we separated the notions of shape 
and texture in the 2D representations just described? 
Ideally, the ultimate shape description would be a 3D 
one where the (x, y, z) coordinates are represented. Tex- 
ture would be a description of local surface albedo at 
each feature point on the object. Such descriptions are 
common for the modeling of 3D objects for computer 
graphics, and it would be nice for vision algorithms to 
invert the imaging or "rendering" process from 3D mod- 
els to 2D images. 

What our 2D vectorized description has done, how- 
ever, is to factor out and explicitly represent the salient 
aspects of 2D shape. The true spatial density of this 
2D representation depends, of course, on the density of 
features defining standard shape, shown in our case in 
Fig. 2. Some aspects of 2D shape, such as lip or eyebrow 
thickness, will end up being encoded in our model for 
texture. However, one could extend the standard fea- 
ture set to include more features around the mouth and 
eyebrows if desired. For texture, there are non-albedo 
factors confounded in the texture component, such as 
lighting conditions and the z-component of shape. Over- 
all, though, remember that only one view of the object 
being vectorized is available, thus limiting our access to 
3D information. We hope that the current definitions of 




Figure 4: Vectorizing face images: if we know who the 
person is and have prior example views i a of their face, 
then we can manually warp i a to standard shape, pro- 
ducing a reference t a . New images of the person can be 
vectorized by computing optical flow between t a and the 
new input. However, if we do not have prior knowledge 
of the person being vectorized, we can still synthesize an 
approximation to t a , t a , by taking a linear combination 
of prototype textures. 



shape and texture are a reasonable approximation to the 
desired decomposition. 

2.2 Shape/texture coupling 

One of the main results of this paper is that the com- 
putations for the shape and texture components can be 
algorithmically coupled. That is, shape can be used to 
geometrically normalize the input image prior to texture 
analysis. Likewise, the result of texture analysis can be 
used to synthesize a reference image for finding corre- 
spondences in the shape computation. The result is an 
iterative algorithm for vectorizing images of faces. Let 
us now explore the coupling of shape and texture in more 
detail. 

2.2.1 Shape perspective 

Since the vectorized representation is determined by 
an ordered set of feature points, computing the represen- 
tation is essentially a feature finding or correspondence 
task. Consider this correspondence task under a special 
set of circumstances: we know who the person is, and we 
have prior example views of that person. In this case, a 
simple correspondence finding algorithm such as optical 
flow should suffice. As shown in the left two images of 
Fig. 4, first a prior example i a of the person's face is 
manually warped in an off-line step to standard shape, 
producing a reference image t a . A new image of the same 
person can now be vectorized simply by running an op- 
tical flow algorithm between the image and reference t a . 

If we have no prior knowledge of the person being 
vectorized, the correspondence problem becomes more 
difficult. In order to handle the variability seen in facial 
appearance across different people, one could imagine us- 
ing many different example reference images that have 
been pre-warped to the standard reference shape. These 
reference images could be chosen, for example, by run- 
ning a clustering algorithm on a large ensemble of exam- 
ple face images. This solution, however, introduces the 
problem of having to choose among the reference images 
for the final vectorization, perhaps based on a confidence 
measure in the correspondence algorithm. 



Going one step further, in this paper we use a statis- 
tical model for facial texture in order to assist the corre- 
spondence process. Our texture model relies on the as- 
sumption, commonly made in the eigenface approach to 
face recognition and detection (Turk and Pentland [37], 
Pentland, et al. [26]), that the space of grey level images 
of faces is linearly spanned by a set of example views. 
That is, the geometrically normalized texture vector t a 
from the input image i a can be approximated as a linear 
combination of n prototype textures t p , 1 < j < n 
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where the t p are themselves geometrically normalized 
by warping them to the standard reference shape. The 
rightmost image of Fig. 4, for example, shows an ap- 
proximation t a that is generated by taking a linear com- 
bination of textures as in equation (1). If the vector- 
ization procedure can estimate a proper set of /3j coeffi- 
cients, then computing correspondences should be sim- 
ple. Since the computed "reference" image t a approxi- 
mates the texture t a of the input and is geometrically 
normalized, we are back to the situation where a simple 
correspondence algorithm like optical flow should work. 
In addition, the linear /3j coefficients act as a low dimen- 
sional code for representing the texture vector t a . 

This raises the question of computing the /3j coeffi- 
cients for the texture model. Let us now consider the 
vectorization procedure from the perspective of model- 
ing texture. 

2.2.2 Texture perspective 

To develop the vectorization technique from the tex- 
ture perspective, consider the simple eigenimage, or 
"eigenface" , model for the space of grey level face images. 
The eigenface approach for modeling face images has 
been used recently for a variety of facial analysis tasks, 
including face recognition (Turk and Pentland [37], Aka- 
matsu, et al. [2], Pentland, et al. [26]), reconstruction 
(Kirby and Sirovich [22]), face detection (Sung and Pog- 
gio [35], Moghaddam and Pentland [25]), and facial fea- 
ture detection (Pentland, et al. [26]). The main assump- 
tion behind this modeling approach is that the space of 
grey level images of faces is linearly spanned by a set of 
example face images. To optimally represent this "face 
space" , principal component analysis is applied to the 
example set, extracting an orthogonal set of eigenimages 
that define the dimensions of face space. Arbitrary faces 
are then represented by the set of coefficients computed 
by projecting the face onto the set of eigenimages. 

One requirement on face images, both for the exam- 
ple set fed to principal components and for new images 
projected onto face space, is that they be geometrically 
normalized so that facial features line up across all im- 
ages. Most normalization methods use a global trans- 
form, usually a similarity or affine transform, to align 
two or three major facial features. For example, in Pent- 
land, et al. [26], the imaging apparatus effectively regis- 
ters eyes, and Akamatsu, et al. [2] register the eyes and 
mouth. 



However, because of the inherent variability of facial 
geometries across different people, aligning just a couple 
of features - such as the eyes - leaves other features mis- 
aligned. To the extent that some features are misaligned, 
even this normalized representation will confound differ- 
ences in grey level information with differences in local 
facial geometry. This may limit the representation's gen- 
eralization ability to new faces outside the original ex- 
ample set used for principal components. For example, a 
new face may match the texture of one particular linear 
combination of eigenimages but the shape may require 
another linear combination. 

To decouple 

texture and shape, Craw and Cameron [18] and Shack- 
elton and Welsh [33] represent shape separately and use 
it to geometrically normalize face texture by deforming 
it to a standard shape. Shape is defined by the (x,y) 
locations of a set of feature points, as in our definition 
for shape. In Craw and Cameron [18], 76 points outlin- 
ing the eyes, nose, mouth, eyebrows, and head are used. 
To geometrically normalize texture using shape, image 
texture is deformed to a standard face shape, making 
it "shape free" . This is done by first triangulating the 
image using the features and then texture mapping. 

However, they did not demonstrate an effec- 
tive automatic method for computing the vectorized 
shape/texture representation. This is mainly due to diffi- 
culties in finding correspondences for shape, where prob- 
ably on the order of tens of features need to be located. 
Craw and Cameron [18] manually locate their features. 
Shackelton and Welsh [33], who focus on eye images, use 
the deformable template approach of Yuille, Cohen, and 
Hallinan [41] to locate eye features. However, for 19/60 
of their example eye images, feature localization is either 
rated as "poor" or "no fit" . 

Note that in both of these approaches, computation of 
the shape and texture components have been separated, 
with shape being computed first. This differs from our 
approach, where shape and texture computations are in- 
terleaved in an iterative fashion. In their approach the 
link from shape to texture is present - using shape to 
geometrically normalize the input. But using a texture 
model to assist finding correspondences is not exploited. 

2.2.3 Combining shape and texture 

Our face vectorizer consists of two primary steps, a 
shape step that computes vectorized shape y a t ^ std and 
a texture step that uses the texture model to approx- 
imate the texture vector t a . Key to our vectorization 
procedure is linking the two steps in a mutually bene- 
ficial manner and iterating back and forth between the 
two until the representation converges. First, consider 
how the result of the texture step can be used to as- 
sist the shape step. Assuming for the moment that the 
texture step can provide an estimate t a using equation 
(1), then the shape step estimates y a t ^ std by computing 

optical flow between the input and t a . 

Next, to complete the loop between shape and t ex- 
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ture, consider how the shape y a _ std 

pute the texture approximation t a . The shape y a t ^ std is 
used to geometrically normalize the input image using 



the backward warp 

ta(x) 



i a (x + yf_ d sW (x)), 



where x = [x, y) is a 2D pixel location in standard shape. 
This normalization step aligns the facial features in the 
input image with those in the textures t p . . Thus, when 
t a is approximated in the texture step by projecting it 
onto the linear space spanned by the t p , facial features 
are properly registered. 

Given initial conditions for shape and texture, our 
proposed system switches back and forth between tex- 
ture and shape computations until a stable solution is 
found. Because of the manner in which the shape and 
texture computations feed back on each other, improv- 
ing one component improves the other: better corre- 
spondences mean better feature alignment for textural 
analysis, and computing a better textural approximation 
improves the reference image used for finding correspon- 
dences. Empirically, we have found that the representa- 
tion converges after several iterations. 

Now that we have seen a general outline of our vec- 
torizer, let us explore the details. 

3 Basic Vectorization Method 

The basic method for our vectorizer breaks down into 
two main parts, the off-line preparation of the example 
textures t p , and the on-line vectorization procedure ap- 
plied to a new input image. 

3.1 Off-line preparation of examples 

The basic assumption made in modeling vectorized tex- 
ture is that the space of face textures is linearly spanned 
by a set of geometrically normalized example face tex- 
tures. Thus, in constructing a vectorizer we must first 
collect a group of representative faces that will define 
face space, the space of the textural component in our 
representation. Before using the example faces in the 
vectorizer, they are geometrically normalized to align 
facial features, and the grey levels are processed using 
principal components or the pseudoinverse to optimize 
run-time textural processing. 

3.1.1 Geometric normalization 

To geometrically normalize an example face, we ap- 
ply a local deformation to the image to warp the face 
shape into a standard geometry. This local deformation 
requires both the shape of the example face as well as 
some definition of the standard shape. Thus, our off-line 
normalization procedure needs the face shape component 
for our example faces, something we provide manually. 
These manual correspondences are averaged to define the 
standard shape. Finally, a 2D warping operation is ap- 
plied to do the normalization. We now go over these 
steps in more detail. 

First, to define the shape of the example faces, a set of 
line segment features are positioned manually for each. 
The features, shown in Fig. 1, follow Beier and Neely's [5] 
manual correspondence technique for morphing face im- 
ages. Pairing up image feature points into line segments 
gives one a natural control over local scale and rotation 




Figure 5: Examples of off-line geometrical normalization 
of example images. Texture for the normalized images is 
sampled from the original images - that is why the chin 
is generated for the second example. 



in the eventual deformation to standard shape, as we will 
explain later when discussing the deformation technique. 

Next, we average the line segments over the example 
images to define the standard face shape (see Fig. 2). 
We don't have to use averaging - since we are creating 
a definition, we could have just chosen a particular ex- 
ample face. However, averaging shape should minimize 
the total amount of distortion required in the next step 
of geometrical normalization. 

Finally, images are geometrically normalized using the 
local deformation technique of Beier and Neely [5]. This 
deformation technique is driven by the pairing of line 
segments in the example image with line segments in 
the standard shape. Consider a single pairing of line 
segments, one segment from the example image l ex and 
one from the standard shape l s td- This line segment 
pair essentially sets up a local transform from the region 
surrounding l ex to the region surrounding l s td- The local 
transform resembles a similarity transform except that 
there is no scaling perpendicular to the segment, just 
scaling along it. The local transforms are computed for 
each segment pair, and the overall warping is taken as 
weighted average. Some examples of images before and 
after normalization are shown in Fig. 5. 

3.1.2 Texture processing 

Now that the example faces have been normalized for 
shape, they can be used for texture modeling. Given a 
new input i a , the texture analysis step tries to approx- 
imate the input texture t a as a linear combination of 
the example textures. Of course, given a linear subspace 
such as our face space, one can choose among different 
sets of basis vectors that will span the same subspace. 
One popular method for choosing the basis set, the eigen- 
image approach, applies principal components analysis 
to the example set. Another potential basis set is simply 
the original set of images themselves. We now discuss 
the off-line texture processing required for the two basis 
sets of principal components and the original images. 

Principal components analysis is a classical technique 
for reducing the dimensionality of a cluster of data 



points, where the data are assumed to be distributed 
in an ellipsoid pattern about a cluster center. If there is 
correlation in the data among the coordinate axes, then 
one can project the data points to a lower dimensional 
subspace without losing information. This corresponds 
to an ellipsoid with interesting variation along a num- 
ber of directions that is less than the dimensionality of 
the data points. Principal components analysis finds the 
lower dimensional subspace inherent in the data points. 
It works by finding a set of directions e; such that the 
variance in the data points is highest when projected 
onto those directions. These e; directions are computed 
by finding the eigenvectors of the of the covariance ma- 
trix of the data points. 

In our ellipsoid of n geometrically normalized textures 
t Pi , let t' . be the set of textures with the mean t mean 
subtracted off 



.7=1 



1 < j < n. 



If we let T be a matrix where the jth column is t' 
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\t' t' 
then the covariance matrix is defined as 

£ = TT % . 

Notice that T is a to x n matrix, where to is the number of 
pixels in vectorized texture vectors. Due to our pixelwise 
representation for shape, to >> n and thus S, which is 
a to x to matrix, is quite large and may be intractable 
for eigenanalysis. Fortunately, one can solve the smaller 
eigenvector problem for the n x n matrix T % T . This is 
possible because an eigenvector e; of T % T 

T f T e* = A;e* 

corresponds to an eigenvector Te 8 - of S. This can be 
seen by multiplying both sides of the above equation by 
matrix T 

(TT r ) Te { = A,-Te,-. 

Since the eigenvectors (or eigenimages) e; with the larger 
eigenvalues A 8 - explain the most variance in the example 
set, only a fraction of the eigenimages need to be retained 
for the basis set. In our implementation, we chose to use 
roughly half the eigenimages. Fig. 6 shows the mean face 
and the first 6 eigenimages from a principal components 
analysis applied to a group of 55 people. 

Since the eigenimages are orthogonal (and can easily 
be normalized to be made orthonormal), analysis and re- 
construction of new image textures during vectorization 
can be easily performed. Say that we retain N eigenim- 
ages, and let t a be a geometrically normalized texture 
to analyze. Then the run-time vectorization procedure 
projects t a onto the e 8 - 
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and can reconstruct t a , yielding t a 
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Figure 6: Mean image and eigenimages from applying 
principal components analysis to the geometrically nor- 
malized examples. 



Another potential basis set is the original example 
textures themselves. That is, we approximate t a by a 
linear combination of the n original image textures t Pi 



ta — / „' — i Pi ''pi ■ 



(4) 



While we do not need to solve this equation until on- 
line vectorization, previewing the solution will elucidate 
what needs to be done for off-line processing. Write 
equation (4) in matrix form 



T(], 



(5) 



where t a is written as a column vector, T is a matrix 
where the ith column is t Pi , and /? is a column vector of 
the /?;'s. Solving this with linear least squares yields 

P = TU a (6) 

= (T*T) _1 T* t a (7) 

where T^ = (J 1 *! 1 ) -1 ? 1 * is the pseudoinverse of T. The 
pseudoinverse can be computed off-line since it depends 
only on the example textures t Pi . Thus, run-time vec- 
torization performs texture analysis with the columns of 
T"t (equation (6)) and reconstruction with the columns 
of T (equation (5)). Fig. 7 shows some example images 
processed by the pseudoinverse where n was 40. 

Note that for both basis sets, the linear coefficients are 
computed using a simple projection operation. Coding- 
wise at run-time, the only difference is whether one sub- 
tracts off the mean image t mean . In practice though, 
the eigenimage approach will require fewer projections 
since not all eigenimages are retained. Also, the orthog- 
onality of the eigenimages may produce a more stable 
set of linear coefficients - consider what happens for the 
pseudoinverse approach when two example images are 
similar in texture. Yet another potential basis set, one 
that has the advantage of orthogonality, would be the 
result of applying Gram-Schmidt orthonormalization to 
the example set. 

Most of our vectorization experiments have been with 
the eigenimage basis, so the notation in the next section 
uses this basis set. 

3.2 Run-time vectorization 

In this section we go over the details of the vectorization 
procedure. The inputs to the vectorizer are an image i a 




Figure 7: Example textures processed by the pseudoin- 
verse T^ = (J 1 *? 1 ) -1 ? 1 * _ When using the original set of 
image textures as a basis, texture analysis is performed 
by projection onto these images. 



to vectorize and a texture model consisting of N eigen- 
images e; and mean image t mean . In addition, the vec- 
torizer takes as input a planar transform P that selects 
the face region from the image i a and normalizes it for 
the effects of scale and image-plane rotation. The pla- 
nar transform P can be a rough estimate from a coarse 
scale analysis. Since the faces in our test images were 
taken against a solid background, face detection is rel- 
atively easy and can be handled simply by correlating 
with a couple face templates. The vectorization proce- 
dure refines the estimate P , so the final outputs of the 
procedure are the vectorized shape y a _ std , a set of /?; 

coefficients for computing t a , and a refined estimate of 
P. 

As mentioned previously, the interconnectedness of 
the shape and texture steps makes the iteration con- 
verge. Fig. 8 depicts the convergence of the vectoriza- 
tion procedure from the perspective of texture. There 
are three sets of face images in the figure, sets of (1) all 
face images, (2) geometrically normalized face textures, 
and (3) the space of our texture model. The difference 
between the texture model space and the set of geomet- 
rically normalized faces depends on the prototype set of 
n example faces. The larger and more varied this set be- 
comes, the smaller the difference becomes between sets 
(2) and (3). Here we assume that the texture model is 
not perfect, so the true t a is slightly outside the texture 
model space. 

The goal of the iteration is to make estimates of t a 
and t a converge to the true t a . The path for t a , the 
geometrically normalized version of i a , is shown by the 
curve from i a to the final t a . The path for t a is shown 
by the curve from initial t a to final t a . The texture and 
shape steps are depicted by the arrows jumping between 
the curves. The texture step, using the latest estimate of 
shape to produce t a , projects t a into the texture model 
space. The shape step uses the latest t a to find a new 
set of correspondences, thus updating shape and hence 
t a . As one moves along the t a curve, one is getting 
better estimates of shape. As one moves along the t a 



curve, the /?; coefficients in the texture model improve. 
Since the true t a lies outside the texture model space, 
the iteration stops at final t a . This error can be made 
smaller by increasing the number of prototypes for the 
texture model. 

We now look at one iteration step in detail. 

3.2.1 One iteration 

In examining one iteration of the texture and shape 
steps, we assume that the previous iteration has pro- 
vided an estimate for y s a t _^ std and the /?; coefficients. For 

the first iteration, an initial condition of y a t ^ std = is 
used. No initial condition is needed for texture since the 
iteration starts with the texture step. 

In the texture step, first the input image i a is geo- 
metrically normalized using the shape estimate y a t ^ std , 



producing t a 



t a (x) = i a (x 



■ y 6 a -std( x ))' 



(8) 



where x = [x, y) is a pixel location in the standard shape. 
This is implemented as a backwards warp using the flow 
vectors pointing from the standard shape to the input. 
Bilinear interpolation is used to sample i a at non-integral 
(x,y) locations. Next, t a is projected onto the eigenim- 
ages e; using equation (2) to update the linear coeffi- 
cients /?;. These updated coefficients should enable the 
shape computation to synthesize an approximation t a 
that is closer to the true t a . 

In the shape step, first a reference image t a is syn- 
thesized from the texture coefficients using equation (3). 
Since the reference image reconstructs the texture of the 
input, it should be well suited for finding shape corre- 
spondences. Next, optical flow is computed between t a , 
which is geometrically normalized, and i a , which updates 
the pixelwise correspondences y a t ^ std . For optical flow, 
we used the gradient-based hierarchical scheme of Bergen 
and Adelson [7], Bergen and Hingorani [9], and Bergen, 
et al. [8]. The new correspondences should provide bet- 
ter geometrical normalization in the next texture step. 

Overall, iterating these steps until the representa- 
tion stabilizes is equivalent to iteratively solving for the 
y^-std an d Pi which best satisfy 
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3.2.2 Adding a global transform 

We introduce a planar transform P to select the image 
region containing the face and to normalize the face for 
the effects of scale and image-plane rotation. Let i' a be 
the input image i a resampled under the planar transform 
P 

£(x) = i a (P(x)). (9) 

It is this resampled image i' a that will be geometrically 
normalized in the texture step and used for optical flow 
in the shape step. 

Besides selecting the face, the transform P will also be 
used for selecting subimages around individual features 
such as the eyes, nose, and mouth. As will be explained 
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Figure 8: Convergence of the vectorization procedure with regards to texture. The texture and shape steps try to 
make t a and t a converge to the true t a . 



in the next section on our hierarchical implementation, 
the vectorization procedure is applied in a coarse-to-fine 
strategy on a pyramid structure. Full face templates are 
vectorized at the coarser scales and individual feature 
templates are vectorized at the finer scales. 
Transform P will be a similarity transform 
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where the scale s, image-plane rotation 9 , and 2D trans- 
lation (t x , t y ) are determined in one of two ways, depend- 
ing on the region being vectorized. 

1. Two point correspondences. Define anchor points 
q«td,i an d q s td,2 in standard shape, which can be 
done manually in off-line processing. Let q a i and 
q a 2 be estimates of the anchor point locations in 
the image i a , estimates which need to be performed 
on-line. The similarity transform parameters are 
then determined such that 

P(<lstd,i) = q a ,i, P((lstd,2) = q a ,2- (10) 

This uses the full flexibility of the similarity trans- 
form and is used when the image region being vec- 
torized contains two reliable feature points such as 
the eyes. 

2. Fixed s, 6, and one point correspondence. In this 
case there is only one anchor point q_ s td,i, and one 
solves for t x and t y such that 

P (q«td,i) = q a ,i- (11) 

This is useful for vectorizing templates with less 
reliable features such as the nose and mouth. For 
these templates the eyes are vectorized first and 
used to fix the scale and rotation for the nose and 
mouth. 



While the vectorizer assumes that a face finder has 
provided an initial estimate for P, we would like the 
vectorizer to be insensitive to a coarse or noisy estimate 
and to improve the estimate of P during vectorization. 
The similarity transform P can be updated during the 
iteration when our estimates change for the positions of 
the anchor points q a ,i- This can be determined after 
the shape step computes a new estimate of the shape 
y S a-std- We can tell that an anchor point estimate is off 
when there is nonzero flow at the anchor point 
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> threshold. 



The correspondences can be used to update the anchor 
point estimate 



q a 



P(<b 



td,t 
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Next, P can be updated using the new anchor point loca- 
tions using equation (10) or (11) and i a can be resampled 
again using equation (9) to produce a new i' a . 

3.2.3 Entire procedure 

The basic vectorization procedure is now summarized. 
Lines 2(a) and (b) are the texture step, lines 2(c) and (d) 
are the shape step, and line 2(e) updates the similarity 
transform P. 

procedure vectorize 

1. initialization 

(a) Estimate P using a face detector. For exam- 
ple, a correlational face finder using averaged 
face templates can be used to estimate the 
translational component of P. 

(b) Resample i a using the similarity transform P, 
producing i' a (equation (9)). 



(c) y>**, td = 0. 

2. iteration: solve for y s a t _^ std , /3{, and P by iterating 
the following steps until the /?; stop changing. 

(a) Geometrically normalize i' a using y s a t _^ std , pro- 
ducing t a 

t a (x) = ^(x + yf_ d sW (x)). 

(b) Project t a onto example set e;, computing the 
linear coefficients /?; 

Pz — 6 Z ' • ^t a *>mean): ^ _^ ^ \ /}. 

(c) Compute reference image t a for correspon- 
dence by reconstructing the geometrically 
normalized input 

(d) Compute the shape component using optical 
flow 

yf_ d sW = optical-flow^,^). 

(e) If the anchor points are misaligned, as indi- 
cated by optical flow, then: 

i. Update P with new anchor points, 
ii. Resample i a using the similarity trans- 
form P, producing i' a (eqn (9)). 

optical-flow(i^ , t a ). 



iii v std 



Fig. 9 shows snapshot images of i' a , t a , and t a during 
each iteration of an example vectorization. The iteration 
number is shown in the left column, and the starting in- 
put is shown in the upper left. We deliberately provided 
a poor initial alignment for the iteration to demonstrate 
the procedure's ability to estimate the similarity trans- 
form P. As the iteration proceeds, notice how (1) im- 
provements in P lead to a better global alignment in i' a , 

(2) the geometrically normalized image t a improves, and 

(3) the image t a becomes a more faithful reproduction 
of the input. The additional row for i' a is given because 
when step 2(e) is executed in the last iteration, i' a is 
updated. 

3.3 Pose dependence from the example set 

The example images we have used in the vectorizer so 
far have been from a frontal pose. What about other 
poses, poses involving rotations out of the image plane? 
Because we are being careful about geometry and cor- 
respondence, the example views used to construct the 
vectorizer must be taken from the same out-of-plane im- 
age rotation. The resulting vectorizer will be tuned to 
that pose, and performance is expected to drop as an 
input view deviates from that pose. The only thing that 
makes the vectorizer pose-dependent, however, is the set 
of example views used to construct face space. The it- 
eration step is general and should work for a variety of 
poses. Thus, even though we have chosen a frontal view 
as an example case, a vectorizer tuned for a different 
pose can be constructed simply by using example views 
from that pose. 
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Figure 9: Snapshot images of i' a , t a , and t a during the 
three iterations of an example vectorization. See text for 
details. 



In section 5.1 on applying the vectorizer to feature de- 
tection, we demonstrate two vectorizers, one tuned for 
a frontal pose, and one for an off-frontal pose. Later, 
in section 6.3, we suggest a multiple-pose vectorizer that 
connects different pose-specific vectorizers through inter- 
polation. 

4 Hierarchical implementation 

For optimization purposes, the vectorization procedure 
is implemented using a coarse-to-fine strategy. Given 
an input image to vectorize, first the Gaussian pyramid 
(Burt and Adelson [14]) is computed to provide a mul- 
tiresolution representation over 4 scales, the original im- 
age plus 3 reductions by 2. A face finder is then run 
over the coarsest level to provide an initial estimate for 
the similarity transform P. Next, the vectorizer is run 
at each pyramid level, working from the coarser to finer 
levels. As processing moves from a coarser level to a 
finer one, the coarse shape correspondences are used to 
initialize the similarity transform P for the vectorizer at 
the finer level. 

4.1 Face finding at coarse resolution 

For our test images, face detection is not a major prob- 
lem since the subjects are shot against a uniform back- 
ground. For the more general case of cluttered back- 
grounds, see the face detection work of Reisfeld and 
Yeshurun [32], Ben-Arie and Rao [6], Sung and Pog- 
gio [35], Sinha [34], and Moghaddam and Pentland [25]. 
For our test images, we found that normalized correla- 
tion using two face templates works well. The normal- 
ized correlation metric is 

<TI>-<T><I> 



«t(T)«t(I) 

where T is the template, I is the subportion of image be- 
ing matched against, < TI > is normal correlation, <> 
is the mean operator, and <r() measures standard devia- 
tion. The templates are formed by averaging face grey 
levels over two populations, an average of all examples 
plus an average over people with beards. Before aver- 
aging, example face images are first warped to standard 
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Figure 10: Face finding templates are grey level averages 
using two populations, all examples (left) plus people 
with beards (right). 



shape. Our two face templates for a frontal pose are 
shown in Fig. 10. To provide some invariance to scale, 
regions with high correlation response to these templates 
are examined with secondary correlations where the scale 
parameter is both increased and decreased by 20%. The 
location/scale of correlation matches above a certain 
threshold are reported to the vectorizer. 

4.2 Multiple templates at high resolution 

When processing the different pyramid levels, we use a 
whole face template at the two coarser resolutions and 
templates around the eyes, nose, and mouth for the two 
finer resolutions. This template decomposition across 
scales is similar to Burt's pattern tree approach [13] for 
template matching on a pyramid representation. At a 
coarse scale, faces are small, so full face templates are 
needed to provide enough spatial support for texture 
analysis. At a finer scale, however, individual features - 
eyes, noses - cover enough area to provide spatial sup- 
port for analysis, giving us the option to perform sep- 
arate vectorizations. The advantage of decoupling the 
analysis of the eyes, nose, and mouth is that it should 
improve generalization to new faces not in the original 
example set. For example, if the eyes of a new face use 
one set of linear texture coefficients and the nose uses 
another, separate vectorization for the eyes and nose 
provides the extra flexibility we need. However, if new 
inputs always come from people in the original example 
set, then this extra flexibility is not required and keeping 
to whole-face templates should be a helpful constraint. 
When vectorizing separate eyes, nose, and mouth tem- 
plates at the finer two resolutions, the template of the 
eyes has a special status for determining the scale and 
image-plane rotation of the face. The eyes template is 
vectorized first, using 2 iris features as anchor points for 
the similarity transform P. Thus, the eyes vectoriza- 
tion estimates a normalizing similarity transform for the 
face. The scale and rotation parameters are then fixed 
for the nose and mouth vectorizations. Only one anchor 
point is used for the nose and mouth, allowing only the 
translation in P to change. 

4.3 Example results 

For the example case in Fig. 11, correspondences from 
the shape component are plotted over the four levels 
of the Gaussian pyramid. These segment features are 
generated by mapping the averaged line segments from 
Fig. 2 to the input image. To get a sense of the fi- 
nal shape/texture representation computed at the high- 
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est resolution, Fig. 12 displays the final output for the 
Fig. 11 example. For the eyes, nose and mouth tem- 
plates, we show i' a , the geometrically normalized tem- 
plates t a , and the reconstruction of those templates t a 
using the linear texture coefficients. No images of this 
person were used among the examples used to create the 
eigenspaces. 

We have implemented the hierarchical vectorizer in C 
on an SGI Indy R4600 based machine. Once the example 
images are loaded, multilevel processing takes just a few 
seconds to execute. 

Experimental results presented in the next section on 
applications will provide a more thorough analysis of the 
vectorizer. 

5 Applications 

Once the vectorized representation has been computed, 
how can one use it? The linear texture coefficients can be 
used as a low-dimensional feature vector for face recog- 
nition, which is the familiar eigenimage approach to face 
recognition [37] [2] [26]. Our application of the vectorizer, 
however, has focused on using the correspondences in the 
shape component. In this section we describe experimen- 
tal results from applying these correspondences to two 
problems, locating facial features and the registration of 
two arbitrary faces. 

5.1 Feature finding 

After vectorizing an input image i a , pixelwise correspon- 
dence in the shape component y 6 a _ std provides a dense 
mapping from the standard shape to the image i a . Even 
though this dense mapping does more than locate just 
a sparse set of features, we can sample the mapping to 
locate a discrete set of feature points in i a . To accom- 
plish this, first, during off-line example preparation, the 
feature points of interest are located manually with re- 
spect to the standard shape. Then after the run-time 
vectorization of i a , the feature points can be located in 
i a by following the pixelwise correspondences and then 
mapping under the similarity transform P. For a feature 
point q s td ln standard shape, its corresponding location 
in i a is 

P((lstd + y s a- s t d (q s td))- 

For example, the line segment features of Fig. 2 can 
be mapped to the input by mapping each endpoint, as 
shown for the test images in Fig. 13. 

In order to evaluate these segment features located 
by the vectorizer, two vectorizers, one tuned for a frontal 
pose and one for a slightly rotated pose, were each tested 
on separate groups of 62 images. The test set consists 
of 62 people, 2 views per person - a frontal and slightly 
rotated pose - yielding a combined test set of 124 im- 
ages. Example results from the rotated view vectorizer 
are shown in Fig. 14. Because the same views were used 
as example views to construct the vectorizers, a leave- 
6-out cross validation procedure was used to generate 
statistics. That is, the original group of 62 images from a 
given pose were divided into 11 randomly chosen groups 
(10 of 6 people, 1 of the remaining 2 people). Each group 
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Figure 11: Evolution of the shape component during 
coarse-to-fine processing. The shape component is dis- 
played through segment features which are generated by 
mapping the averaged line segments from Fig. 2 to the 
input image. 
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Figure 12: Final vectorization at the original image res- 
olution. 
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Figure 13: Example features located by sampling the 
dense set of shape correspondences y 6 a _ std found by the 
vectorizer. 



of images is tested using a different vectorizer; the vec- 
torizer for group G is constructed from an example set 
consisting of the original images minus the set G. This 
allows us to separate the people used as examples from 
those in the test set. 

Qualitatively, the results were very good, with only 
one mouth feature being completely missed by the vec- 
torizer (it was placed between the mouth and nose). To 
quantitatively evaluate the features, we compared the 
computed segment locations against manually located 
"ground truth" segments, the same segments used for 
off-line geometrical normalization. To report statistics 
by feature, the segments in Fig. 2 are grouped into 6 
features: left eye (c 3 , c 4 , c 5 , c 6 ), right eye (c 9 , c w , cu, 
C12), left eyebrow (c\, C2), right eyebrow (07, eg), nose 
(rii, U2, nz), and mouth (mi, 012). 

Two different metrics were used to evaluate how close 
a computed segment came to its corresponding ground 
truth segment. Segments in the more richly textured ar- 
eas (e.g. eye segments) have local grey level structure at 
both endpoints, so we expect both endpoints to be ac- 




Figure 14: Example features located by the vectorizer. 



curately placed. Thus, the "point" metric measures the 
two distances between corresponding segment endpoints. 
On the other hand, some segments are more edge-like, 
such as eyebrows and mouths. For the "edge" metric 
we measure the angle between segments and the perpen- 
dicular distance from the midpoint of the ground truth 
segment to the computed segment. 

Next, the distances between the manual and com- 
puted segments were thresholded to evaluate the close- 
ness of fit. A feature will be considered properly detected 
when all of its constituent segments are within thresh- 
old. Using a distance threshold of 10% of the interocular 
distance and an angle threshold of 20° , we compute de- 
tection rates and average distances between manual and 
computed segments (Table 1). The eyebrow and nose er- 
rors are more of a misalignment of a couple points rather 
than a complete miss (the mouth error was a complete 
miss). 

In the next section we consider another application of 
the shape component computed by the vectorizer. 

5.2 Registration of two arbitrary faces 

Suppose that we have only one view of an individual's 
face and that we would like to synthesize other views, 
perhaps rotated views or views with different expres- 
sions. These new "virtual" views could be used, for ex- 
ample, to create an animation of the individual's face 
from just one view. For the task of face recognition, vir- 
tual views could be used as multiple example views in 
a view-based recognizer. In this section, we discuss how 
the shape component from the vectorizer can be used to 
synthesize virtual views. In addition, these virtual views 
are then evaluated by plugging them into a view-based, 
pose-invariant face recognizer. 

To synthesize virtual views, we need to have prior 
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Figure 15: In parallel deformation, (a) a 2D deformation 
representing a transformation is measured by finding cor- 
respondence among prototype images. In this example, 
the transformation is rotation and optical flow was used 
to find a dense set of correspondences. Next, in (b), the 
flow is mapped onto the novel face, and (c) the novel 
face is 2D warped to a "virtual" view. Figure from [11]. 



knowledge of a facial transformation such as head rota- 
tion or expression change. A standard approach used in 
the computer graphics and computer vision communities 
for representing this prior knowledge is to use a 3D model 
of the face (Akimoto, Suennaga, and Wallace[3], Wa- 
ters and Terzopoulos[36][39], Aizawa, Harashima, and 
Saitofl], Essa and Pentland [20]). After the single avail- 
able 2D image is texture mapped onto a 3D polygo- 
nal or multilayer mesh model of the face, rotated views 
can be synthesized by rotating the 3D model and ren- 
dering. In addition, facial expressions have been mod- 
eled [36] [39] [20] by embedding muscle forces that deform 
the 3D model in a way that mimics human facial mus- 
cles. Mapping image data onto the 3D model is typ- 
ically solved by locating corresponding points on both 
the 3D model and the image or by simultaneously ac- 
quiring both the 3D depth and image data using the 
Cyberware scanner. 

We have investigated an alternative approach that 
uses example 2D views of prototype faces as a substi- 
tute for 3D models (Poggio and Vetter [30], Poggio and 
Brunelli [29], Beymer and Poggio [11]). In parallel defor- 
mation, one of the example-based techniques discussed 
in Beymer and Poggio [11], prior knowledge of a facial 
transformation such as a rotation or change in expression 
is extracted from views of a prototype face undergoing 
the transformation. Shown in Fig. 15, first a 2D de- 
formation representing the transformation is measured 



feature 


detection rate 


average distances 


point metric 


edge metric 


endpt. dist. 
(pixels) 


angle 
(degrees) 


perpend, dist. 
(pixels) 


left eye 


100% (124/124) 


1.24 


- 


- 


right eye 


100% (124/124) 


1.23 


- 


- 


left eyebrow 


97% (121/124) 


- 


5.1° 


1.06 


right eyebrow 


96% (119/124) 


- 


4.8° 


1.06 


nose 


99% (123/124) 


1.45 


3.2° 


0.66 


mouth 


99% (123/124) 


- 


2.2° 


0.53 



Table 1: Detection rates and average distances between computed and "ground truth" segments. Qualitatively, the 
eyebrow and nose errors were misalignments, while the mouth error did involve a complete miss. 
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Figure 16: Example pairs of real and virtual views. 



by finding correspondence between the prototype face 
images. We use the same gradient-based optical flow al- 
gorithm [9] used in the vectorizer to find a dense set of 
pixelwise correspondences. Next, the prototype flow is 
mapped onto the "novel" face, the individual for which 
we wish to generate virtual views. This step requires "in- 
terperson" correspondence between the prototype and 
novel faces. Finally, the prototype flow, now mapped 
onto the novel face, can be used to 2D warp the novel 
face to produce the virtual view. 

The difficult part of parallel deformation is automat- 
ically finding a set of feature correspondences between 
the prototype and novel faces. We have used the vec- 
torizer to automatically locate the set of facial features 
shown in Fig. 14 in both the prototype and novel faces. 
From this sparse set of correspondences, the interpola- 
tion technique from Beier and Neely [5] is used to gen- 
erate a dense, pixelwise mapping between the two faces. 
We then used the dense set of correspondences to map 
rotation deformations from a single prototype to a group 
of 61 other faces for generating virtual views. Fig. 16 
shows some example pairs of real and virtual views. 

To evaluate these virtual views, they were used as 



example views in a view-based, pose-invariant face rec- 
ognizer (see [11] for details). The problem is this: given 
one real view of each person, can we recognize the per- 
son under a variety of poses? Virtual views were used to 
generate a set of rotated example views to augment the 
single real view. Using a simple view-based approach 
that represents faces with templates of the eyes, nose, 
and mouth, we were able to get a recognition rate of 
85% on a test set of 620 images (62 people, 10 views per 
person). To put this number in context, consider the 
recognition results from a "base" case of two views per 
person (the single real view plus its mirror reflection) and 
a "best" case of 15 real views per person. When tested 
on the same test set, we obtained recognition rates of 
70% for the two views case and 98% for the 15 views 
case. Thus, adding virtual views to the recognizer in- 
creases the recognition by 15%, and the performance of 
virtual views is about midway between the base and best 
case scenarios. 

6 Future work 

In this section, first we discuss some shorter-term work 
for the existing vectorizer. This is followed by longer- 
term ideas for extending the vectorizer to use parame- 
terized shape models and to handle multiple poses. 

6.1 Existing vectorizer 

So far the vectorizer has been tested on face images shot 
against a solid background. It would be nice to demon- 
strate the vectorizer working in cluttered environments. 
To accomplish this, both the face detection and vector- 
izer should be made more robust to the presense of false 
positive matches. To improve face detection, we would 
probably incorporate the learning approaches of Sung 
and Poggio [35] or Moghaddam and Pentland [25]. Both 
of these techniques model the space of grey level face 
images using principal components analysis. To judge 
the "faceness" of a image, they use a distance metric 
that includes two terms, "distance from face space" (see 
Turk and Pentland [37]) 

l|ta-t a || 

and the Mahalanobis distance 



n pi 



13 



8 = 1 



A, ' 



where the /?; are the eigenspace projection coefficients 
and Xi are the eigenvalues from principal component 
analysis. This distance metric could be added to the 
vectorizer as a threshold test after the iteration step has 
converged. 

Our current coarse-to-fine implementation does not 
exploit potential constraints that could be passed from 
the coarser to finer scales. The only information cur- 
rently passed from a coarse level to the next finer level 
are feature locations used to initialize the similarity 
transform P. This could be expanded to help initial- 
ize the shape and texture components at the finer level 
as well. 

6.2 Parameterized shape model 

In the current vectorizer, shape is measured in a "data- 
driven" manner using optical flow. However, we can ex- 
plicitly model shape by taking a linear combination of 
example shapes 
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where the shape of the ith example image, y"' t _ std , is the 
2D warping used to geometrically normalize the image 
in the off-line preparation step. This technique for mod- 
eling shape is similar to the work of Cootes, et al. [17], 
Blake and Isard [12], Baumberg and Hogg [4], and Jones 
and Poggio [21]. The new shape step would, given i' a 
and reference t a , try to find a set of coefficients a.{ that 
minimizes the squared error of the approximation 

^(x+E: = i^y;: d _ sW (x)) = t a . 

This involves replacing the optical flow calculation with a 
model-based matching procedure; one can think of it as a 
parameterized "optical flow" calculation that computes 
a single set of linear coefficients instead of a flow vector 
at each point. One advantage of modeling shape is the 
extra constraint it provides, as some "illegal" warpings 
cannot even be represented. Additionally, compared to 
the raw flow, the linear shape coefficients should be more 
amenable for shape analysis tasks like expression analysis 
or face recognition using shape. 

Given this new model for shape in the vectorizer, the 
set of a shape coefficients and /? texture coefficients could 
be used as a low-dimensional representation for faces. 
An obvious application of this would be face recogni- 
tion. Even without the modified vectorizer and the a 
coefficients, the /? coefficients alone could be evaluated 
as a representation for a face recognizer. 

6.3 Multiple poses 

The straightforward way to handle different out-of-plane 
image rotations with the vectorizer is simply to use sev- 
eral vectorizers, each tuned to a different pose. However, 
if we provide pixelwise correspondence between the stan- 
dard shapes of the different vectorizers, their operations 
can be linked together through image interpolation. The 
main idea is to interpolate among the t a images of the 
different vectorizers to produce a new image that recon- 
structs both the grey levels and the pose of the input im- 
age (see Beymer, Shashua and Poggio [10] for examples 
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of interpolation across different poses). Correspondence 
is then found between the input and this new interpo- 
lated image using optical flow. This correspondence, in 
turn, gives us correspondence between the input and the 
individual vectorizers, so the input can be warped to 
each one for a combined textural analysis. This proce- 
dure requires adding pose to the existing state variables 
of shape, texture, and similarity transform P. The out- 
put of this multi-pose vectorizer would be useful for pose 
estimation and pose-invariant face recognition. 

7 Conclusion 

In this paper, we first introduced a vectorized image rep- 
resentation, a feature-based representation where corre- 
spondence has been established with respect to a refer- 
ence image. Two image measurements are made at the 
feature points. First, feature geometry, or shape, is rep- 
resented by the (x,y) feature locations relative to the 
standard face shape. Second, grey levels, or texture, are 
represented by mapping image grey levels onto the stan- 
dard face shape. Given this definition, primary focus of 
this paper is to explore an automatic technique for com- 
puting this vectorized representation for face images. 

To design an algorithm for vectorizing images, or a 
"vectorizer" , we observed that the two representations 
can be linked. That is, for textural analysis, the shape 
component can be used to geometrically normalize an 
image so that features are properly aligned. Conversely, 
for shape analysis, the textural analysis can be used to 
create a reference image that reconstructs a geometri- 
cally normalized version of the input. We can then com- 
pute shape by finding correspondence between the refer- 
ence image, which is at standard shape, and the input. 
The main idea of our vectorizer is to exploit the nat- 
ural feedback between the texture and shape computa- 
tions by iterating back and forth between the two until 
the shape/texture representation converges. We have 
demonstrated an efficient implementation of the vector- 
izer using a hierarchical coarse-to-fine strategy. 

Two applications of the shape component were ex- 
plored, facial feature finding and the registration of two 
faces. In our feature finding experiments, eyes, nose, 
mouth, and eyebrow features were located in 124 test 
images of 62 people at two different poses, and only one 
mouth feature was missed by the system. In the sec- 
ond application, one wants to generate new views of a 
"novel" face given just one view. Prior knowledge of a 
facial transformation such as a rotation is represented 
by 2D example images of a "prototype" face undergoing 
the transformation. The problem here is to register the 
"novel" face with a prototype face. We showed how to 
perform this registration step using features located by 
the vectorized shape component. 
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